This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Which chemical properties influence the quality of red wines?
Now that our packages are loaded, let’s read in and take a peek at the data.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
12 - quality (score between 0 and 10)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
First let’s plot all the variables. From there I will grasp the distribution and check outliers and so on.
When plotting Quality, 5 and 6 appear most frequently. The summary also shows that the average quality is about 5.6.
Looking at sumarry, it seemed that there was not much difference between maximum and minimum in density. So I plot it and see what kind of distribution it is.
Confirming the distribution of density, the average is 0.997. It is understood that it is a normal distribution.
Let’s check the binwidth of x in more detail.
If I reduce the size of bin, I can see that there is a much count bin nearby 0.99.
It is possible to check the deviation value from 1.2 to 1.6.
chlorides is the feature with the biggest difference between the minimum and maximum values. Considering from the average and the median, there is a possibility that it may include outliers.
In the box plot, outliers are specified.
Specify the range further.
According to the plots, you can see that many values are between 0.06 and 0.09.
I’ve been searching for one variable so far. Interestingly, there are few wines with low quality. On the contrary, the high evaluation value of 7 can be seen well.
I also found a variable with outliers. When predicting with machine learning etc, it is necessary to correspond to these outliers.
First, use corrplot to check the correlation. Looking at plot, it turns out that PH has a strong correlation with fixed acidity
Based on this plot, in addition to two variables, I can visualize it including quality. However, it is a little hard to see. Therefore, we divide the quality into two categories and visualize it.
I will create new features based on quality. A quality of 6 or more is a good wine, and less than 5 is a bad wine.
Wine with poor quality seems to tend to have higher density when Fixed Acidity Levels is low.
When volatile.acidity is low and density is high, it tends to be a bad wine. Also, when volatile.acidity is high, it seems that it tends to be a good wine. Since a noticeable trend was seen in this plot, let’s look at the relation between volatile.acidity and quality in more detail.
It is not that strong, but you can see the correlation with quality.
If Citric Acid Levels is low and density is high, it tends to be a good wine, in the opposite case it tends to be a bad wine.
If alcohol is low and density is high, it tends to be a good wine, in the opposite case it tends to be a bad wine.
I plotted the relationship between quality and total.sulfur.dioxide. Outliers can be confirmed when quality is 7.
We can see a weak correlation to many quality. In particular, it seems that 3 and 4 have stronger correlation than others.
A negative correlation can be seen from 4 to 7. Interestingly, in 3, volatile.acidity (alcohol 10.5 to 12) is higher than 8, and even 8 (alcohol 12 to 16) is high. ## Discussion
First we checked the correlation with two variables. However, even if two variables have low correlation, we can see that when we look at three variables we have various influences. It is difficult to actually plot four variables, but do not forget that we are in complexity.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As a result of EDA, the distribution of quality of wine was as above. Most of the wine is a quality of 5 or 6, with 7 and 8 being a relatively good wine, 3 and 4 being a bad wine.
Looking at the correlation of all the variables, we can see that there is a relatively high correlation between alcohol level and volatile.acidity.
In the multivariate plot, we will introduce the difference between the best wine and bad wine remarkably.
This is a diagram visualizing density, alcohol level, quality at once, but when the density and alcohol level are high, you can see that there is a high probability of being a good wine.
The most difficult thing is not only the two variables but also the relationship to quality from the multivariable. The relationship between two variables is comparatively easy to judge by checking the correlation. However, actual things are often more complicated than the relationship of two variables. Therefore, we investigated the relation of three variables this time. According to this survey, it turned out that in the correlation diagram, even variables that were thought to have low correlation values have a relationship when they are three variables.